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SORTING TECHNOLCGY IN PRCGRAMMING In many applications, sorting or the ordering of data records con- 


stitutes a large portion of the total usage of a data processing sys- 
tem. This investigation of sorting techrology is aimed at realizing 
better approaches to the sorting of data. 

An improved manner for carrying out the merging operation is 
achieved by conducting the Phase 2 portion of the program as an 
unbalanced merging operation. In this type of merging, the number 
of input tapes in phase 2 is greater than the number of output tapes. 
After each merge some of the tape assignments are changed in such 
a way that the unbalanced merge can continue. In order for unbal-~ 
anced merging to operate properly, the distribution of sequences, 
as received in phase 2, must follow a definite and predetermined 
relationship. There are a number of different ways in which un- 
balaneed merging can be carried out. Also there are several ways 
in which the sequences can be distributed prior to input to phase 2 
to arrive at the correct sequence relationship. 

This paper describes unbalanced merging, specifically the poly- 
phase merge, and indicates a means for evaluating unbalanced 
merging performance. In addition, a number of sequence re- © 
distribution schemes are discussed and evaluated. 
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SORTING TECHNOLOGY IN PROGRAMMING 
by . 
J.W.. Toner 


INTRODUCTION 


In many applications, sorting or the ordering of data records constitutes a 
large portion of the total usage of a data processing system. This investigation of 
sorting technology is aimed at realizing more optimum approaches to the sorting of 
data. 


Sorting is generally accomplished by reading into the primary storage (core) 
of a computer as large a portion of the total data fileas is possible. These groups are then 
sorted internally and written out on secondary storage (tapes or disks) in a series of 
sorted sequences. This internal sort is generally designated as Phase i of the total sort 
operation. These series of sequences are read into the machine, compared, and com- 
bined into longer sequences which are then written out again on tapes or disk(s). This 
operation of reading in, comparing, merging and writing out is repeated as many times 
as are needed to completely order the entire file and form a single sequence. This 
merging operation is generally designated as external merging in that the data (except 
for those records being compared) is external to core. The merging phase is also 
generally called Phase 2 of the total sorting operation. 


Until a few years ago, the merging operation in Phase 2 was carried out ina 
"balanced mode". In balanced tape merging, the data is distributed out of Phase | 





onto S : tapes where a is equal to half the total number of tapes available 


for merging. The data is passed back and forth in the merge in a balanced fashion, 
i.e. data is output on the same number of tapes as were used for input. One passage 
of the total file through the machine is designated asa pass. In the merging phase 
(Phase 2) there can be a series of passes, as many as needed to order the data into one 
sequence. In most applications, Phase 2 consumes the bulk (70-80%). of the total time 
required to perform the sort. 


An improved manner of carrying out the merging operation is to conduct 
Phase 2 as an unbalanced merging operation. In unbalanced tape merging the number 
of input tapes for a merging operation in Phase 2 is greater than the number of output 
tapes. After each merge some of the tape assignments are changed in such a way 
that the unbalanced merge can continue. In order for unbalanced merging to operate 
properly, the distribution of sequences as received in Phase 2 must follow a definite, 
predetermined relationship. There are a number of different ways in which unbalanced 
merging can be carried out, and there are also a number of ways in which the seavences 
can be distributed before being input to Phase 2 to arrive at the correct sequence 
relationship. 


Some of the unbalanced sorting techniques that can be employed in Phase 


2 are: 
A Polyphase merging (1) 
B Oscillating merging (2) 
CG. Cascade merging (3) 
D Compromise merging (4) 


A. POLYPHASE MERGING 


Polyphase merging on a system with (K + 1) total system tapes causes (K) input 
tapes to be merged, i.e. a (K) - way merge onto a single tape. This (K) - way com- 
parison, designated as a (K) - order merge continues throughout all of Phase 2 until the 
file is sorted. 


B. OSCILLATING MERGING 


Oscillating merging on a system with (K + 1) total system tapes causes (K - 1) 
input tapes to be merged, i.e. a (K - 1) - order merge ontoa single tape. This (K - 1)- 
way merge continues throughout all of Phase 2 until the file is sorted. The name 
“oscillating” derives from the fact that the execution of Phase 1 and Phase 2 is inter- 
leaved so that the sort "oscillates" between performing Phase 1 and Phase 2. In all 
other techniques, both balanced and unbalanced, Phase 2 is not started until all the 
data has been internally sorted and Phase 1 is therefore completed, releasing the 
input tape for use in Phase 2. 


Cc. CASCADE MERGING 


Cascade merging ina system with (K + 1) total! system tapes causes (K) input 
tapes to be merged; then (K - 1), (K - 2) — down toa 2-way merge. This variation 
in the order of the merge exists within each pass of the data. Each pass starts with 
(K) - way comparison until one of the input tapes becomes empty; then (K - 1) 
comparison until another input tape becomes empty; then (K - 2), etc. 


dD. COMPROMISE MERGING 


Compromise merging on a system with (K + 1) total system tapes causes (K), 
(K - 1) — input tapes to be merged as in Cascade Merging. In Compromise Merging, 
the order of merge does not vary from (K) down to 2, but can end at some intermediate 
value (K - Z) where Z is an integer and may depend on the total number of tapes on 
the system; the total number of sequences, or some other sort parameter. (K - Z) 
must be a positive integer. 


The initial effort in the investigation of sorting technology has been directed 
towards the evaluation of the polyphase merge as applied to tape operation. Poly- 
phase merging appears to be advantageous compared to oscillating merging on systems 
equipped with a small number of tapes (3-6 tape units). In this evaluation, the 
definition of a polyphase merge will be that unbalanced merge technique in which 
the number of sequences on the merging fapes is a linear combination of the Kth 
generalized Fibonacci numbers (5) (where K + 1 = the number of tapes available). 

A further assumption will be that the sequences as they are generated in Phase 1 
are all of equal length (i.e., each sequence contains an equal number of records). 


In this discussion only the relative merits of the merging phase of the total 
sort program will be considered. For unbalanced sorting procedures, (A,C,D) there is 
a redistribution pass (or a "lower order of merge" pass for procedure B) to handle the 
situation where the number of sequences is not exactly equal to an unbalanced merge 
table entry combination or where (procedure B) the number of sequences is not an 
exact power of the order of merge. 


It UNBALANCED MERGE EVALUATION PROCEDURE 


One common factor that can be used to compare various approaches to 
merging is the factor "rc" which is the total number of times the records are handled 
(R'), divided by the total number of records being sorted (R), i.e. an average record 
handling figure. For oscillating merging this can be expressed as: 


R 
rt 
(K-1)) > |. res 
where 
r = Rg 
R 
R' = Total number of times records are handled (written). 


R = Total number of records to be sorted. 


~G = Number of records sorted internally in Phase |. 


For a polyphase merge, "r" can be expressed as: 


= pe) gi 
y 


f; = Length in number of sequences merged in each individual poly- 
phase merge operation. 


“" 


where: 


9, = Number of records operated upon in an individual polyphase 
merge operation. 


If the sequence distribution at the start of the polyphase merge is given by 
Tape Number 1 2 3 K K+i 
Original Distribution C, Cy, G,- Cr 0 


where C,> C, > C, > ... Cy and the C terms are linear combinations of the Kthr 
generalized Fibonacci numbers, then the f; sequence can be expressed as: 


First Term = ¢ 
Second Term = = Cgy.i) - Cx These are the first K terms in the 
Third Term Fs CK - 2) = Cc K - oe 
1 i 
§ i] 
t i 
‘ t 
= — (Ce-y- Cw - (C«-2) ~C-» )-... 
(K +1) Term Cx Number of terms subtracted = K - 1 
os _ (C«-2) -CK-1) = (CK-3)-CK-2) - ... 
(K +2) Term ( Cq-1 Ck ) Number of terms subtracted = K - 1 
etc. 


Any term (K + n) can be obtained by taking the Kth previous term in the sequence and 
repetitively subtracting the (K-1) following term from it. 


The g, sequence can be expressed for K = 3 by dropping the first K terms of 
the following sequence 


919293 94 95 
1, 1, 1, 3, 5, 9 17, 31, --- 
First K terms Start of sequence 


4 


Generally the sequence starts following K ones. Each succeeding term is the sum of 


the previous (K) terms. _ 


When two successive terms of the f 


terminates. 


Example 
K+1=4 


Express R = 31 G 


; sequence are unity, the sequence 


i.e., Ris an exact multiple of a polyphase merge table entry combination. 


Tape No. 1 
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—— 13, 11 and 7 are 
Number of Sequences 


Assume each of these sequences 
of equal length = G 


f, sequence 


CG = %F = C3 Series Cp = 7 = FL 
ae | ee 2 C,-C, = =f 
Cp = WW = Cy CpeC, =~ 2 =f, 
(K+1)Tem=Cs- (Co-C3) - (C1-G ) Lia 
(K + 2) Term = (C, -C,) - (C, - Cy )- (K+1) Term = 1 wee 


, vence 
i 9,99; 9, 95 
3, 


5, 9, 17, 31, ---~ 


\ 


Bete 





K = 3 ones Start of sequence 
G 
fe EY oe Fi 9 
a: © f f f f f ) 
= g (fig + fog. + fag + Fagg t+ F595 
6 = 2 7x344x 542x941 x 17+ 1x 31) 
R = (C, + Cy, + C3) G = HNG 


(21 +20 + 18 + 17 + 31) _ 107 


31 = Te = 3.45 


"elt 


The record handling figure can be expressed more rigorously. The K-generalized 
Fibonacci numbers f ; , are defined in (9) as 


fae. = 8 O< j < K - 2 
Fee = 1 

K 
Fix are ay F n-n.K i > K 

n=1 


In (5), a generalized polyphase merge of sequences using K + | tapes is defined in terms 
of linear combinations of the K-generalized Fibonacci numbers where 


for} > (K+1-n) 


azj-1 
Cai = fiche a2 == K) 
a=j-(K-n=)) 
where C n,j,Kjs the number of sequences on tape n at the original distribution at index 
level j for K tapes’. 


Forj < (Kt+1-n) - 


Ca,i.K 7 0 n # 1 


and Ci,jx = 0 i 


IA 
AA 


The set of the Kth generalized numbers varies on index j. For any fixed value of j there 
is generated a value for Cy, j,x,the proper number of sequences on tape n for the poly- 
phase merge. 


The total record handling (R') is the sum of the records handled at each merge 
of the polyphase merge and can be expressed as 


n=K nek 
R! = casaie(6 ys Car) + Ci, j-2,K (s > Conta) 
n=l 


n=1 
n=K n=K 
+ C1, j-3,K (6 > Coen) oor (c > Cue) 
n=] n=l 
n=K 
R ae » Cni,K 
n=1 
a=j-K+1 n=K 
R! Ci, j-a,K Cn, K+a-1,K 
r _ R = = = n= 
Cn,i,K 
n=1 
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The record handling figure can be expressed in terms. of linear combinations of 
the K-generalized Fibonacci numbers by substituting the appropriate Fibonacci 
summations in the expression for "r" just given: 





vp 


a=j-K+1 ( =jra-1 n=K 


a=1 


3 py Fy.K 
n=1 y =J-(K-nel) 
Since by definition 
y =jraci 
AK = F i-a,K 
y =j-a-K 


Then r can also be expressed as 


a=j-K+1 n=K y =K-a-2 
Ss ee oe 


= a=1 n=1 Y =a-2+n 


n=K ¥=)> 
a (dha) 
n=1 ¥ =j-(K-n+l) 


Figure 1 shows a plot of the average number of record handlings as a function 
of file size. File size is given in terms of C x G records where C is an integer. 





From this analysis, it appears that polyphase sorting has some advantages 
when the number of tapes is small. If a three tape system, using unbalanced merging, 
were compared with a 4 tape system using balanced techniques, the average record 
handling figure for the 3 tape system would be approximately 10% greater than for the 
4 tape system. Using this as a comparison criterion, the three tape unbalanced system 
would run approximately 10% more slowly than the 4 tape balanced system. If one 
compares 3 tape unbalanced with 4 tape unbalanced, one finds that the 3 tape would 
operate somewhere in the range of 50% of the speed of the 4 tape. This estimate is 
only approximate since the tape block sizes are not the same; 3 tape unbalanced will 
accommodate a larger block size than 4 tape unbalanced. 


Further investigation of unbalanced merging procedures is called for to 
indicate other possible areas of improvement and optimization. 


- TH SEQUENCE REDISTRIBUTION SCHEMES 


_ In unbalanced polyphase merging, some redistribution of the sequences as 
they are output from Phase 1 is needed to arrive at an exact polyphase table sequence 
distribution level. This redistribution can be accomplished in a variety of ways. 
Approaches that will be evaluated are the redistribution schemes described in "An 
Elementary Polyphase Merge Algorithm" (6), "A Generalized Polyphase Merge 
_ Algorithm" (5) and "A Dispersion Pass Algorithm for the Polyphase Merge" (8). This 
comparison will be made for a three tape system (where K + 1 = the number of tapes 
available = 3). A further assumption will be that the sequences as they are generated 
in Phase 1 are all equal ‘length (i.e., each sequence contains an equal number of 
records). 


A. ELEMENTARY POLYPHASE ALGORITHM 


The redistribution scheme described in (6) was developed for a four tape 
system and has been modified and applied to a three tape system for this comparison. 
This analysis has already been described (7) and only the results will be included for 
this comparison. In this operation, the sequences from Phase 1 are output on 2 tapes 
alternately (the same as for balanced 2 way sort operation). The redistribution phase 
redistributes these sequences to a correct Polyphase fable value. The following terms 
are used in this evaluation: 


a+K' The number of sequences of sorted records on tape 1 prior to redistribution. 
a The number of sequences of sorted records on tape 2 prior to redistribution. 
K' The number, either zero or one, of sequences by which the number of 


sequences on tape | differs from the number of sequences on tape 2 
prior to redistribution. 


S _The total number of sequences to be merged prior to redistribution. 

N The total number of sequences to be merged after redistribution. 

ai, b] ' The number of sequences on tapes | and 2 respectively after 
redistribution. 

pl The number of sequences passed from tape 3 to tape 1. 

p2 The number of sequences passed from tape 3 to tape 2. 

x' The number of sequences merged from tapes 1 and 2 onto tape 3. 


The redistribution is accomplished by: 
1. A 2 way merge of X' sequences from tapes 1 and 2 to 3. 
2. A pass of pI sequences from tape 3 to tape 1. 


3. A pass of p2 sequences from tape 3 to tape 2. 


Redistribution Phase 


Tape 1 Tape 2 Tape 3 
Ss a+ K' “a 0 
a+ K'- xX' a- xX’ x' 
a+ K'- X'+pl a-X' X'=- pl 
a+K'=X'+pl a- X'+p2 X'- pl - p2 
N =a, =b; =0 


In this redistribution the following equations apply: 


s = 2a+K' 

N = a, + b, 

a, = a+K'- X'+pl 
b, = a- X'+p2 

xX' - pl - p2=0 


From these above equations, the following equations may be derived: 


X' = S-N 
pl = a, + X'=(+K’') 
p2 = by +X'-a 


A further restriction that must be applied is that 


S < 2N (This determines the maximum value of N for any value of S). 


10 


From this analysis one can determine the amount of record handling that is 
required in the redistribution phase. If the sequences from Phase 1 are of length G, 
then R' (Number of record handlings) is given by 


R' (7)=4(S-N)G 
R' (7) is the number of record handlings for algorithm (Reference 7). 
The resulting record handlings as a function of file size are shown in the 


accompanying graph (Figure 2). At values of S one higher than 2 x N the value of 
S - N is a minimum which accounts for the steep slopes in the plots. 


Example: 
N=13 Ss N S-N 4(S-N) G=R' (7) 
24 13 1 44G 
25 13 12 48G 
S=2N 26 13 13 52G 
27 21 6 24G 
28 21 7 28G 


As S goes from 26 to 27, R' changes from 52G to 24G. 


B. GENERALIZED POLYPHASE ALGORITHM 


For the Generalized Polyphase Algorithm, the redistribution will begin 
with either an excess of sequences (X) on tape | or an excess on tape 2. 


Case | 

Tape No. 1 2 
Ci +X Cy 
Ci Ce - X x 
Ci 0 Co 


Tape No. 1 2 3 
Ci +C, C, +X 0 
Ci -X 0 Co +X 
Co 0 Ci 
For Case | 
R' = (Co + X) G 
For Case iI 
R= (C,+Cy +X) G 
When X = 0 
R' = 0 for Case | 
R' = 2 Ce2G for Case Il 


When X = 0, the full algorithm need not be accomplished; hence, R' is defined as 
shown. 


Cc. DISPERSION PASS ALGORITHM 


For the "dispersion pass algorithm" described in (8), the sequences are built 
up in such a manner that the largest number of sequences at any polyphase table entry 
value (Cj ) remains constant as the sequences on the remaining tape(s) Co , C3 , etc., 
are increased alternately to attain the next table values. Using the same notation as 
was used in (5), we can conclude for a 3 tape case: 


C, at level i of polyphase table is to remain constant until level i + 1 is 


reached. 


C2 at level i is to be incremented by m (Reference 8: Case 3) until level i + 1 
is reached. 


where 


m is the number of sequences that C. has been incremented towards the (i+1) 
table level. 
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Since 


Algorithm 
1. Copy (Ci - m) sequences from "C, " tape onto empty tape No. 3. R' (8) = (C, +m) G 


2.  2way merge m sequences from "C1 “and "Co " tapes onto No. 3. ane 
For this algorithm: Ci + Co+ m=S 
Ri = (Ci+m) G R' (8) =(S-€,) G 
In (5), starting from any table value, C1 is always incremented first until Savipls: 


the next level (C; + Cz ) is attained. 
Algorithm (5) 


For Case | ‘ 
or Case ; R' = (C. +x) Case I; X=0, 
R' = (C, +X) G R'=0 
(C, ) Number of Type R'= (C1 +Cy +x) Case Il; 
Where X is number of sequences C, has been incremented towards the (i+1) table sequences wa Ce Case us X=0 R= 2C, 
level. ; 13 5 7 0 0 
For this range - 14 g 5 1 1 6 
ie . ; 16 IT 5 i 3 8 
R' (5) is the number of record handlings for algorithm (Reference 5). 7 2 ‘ 4 ‘ 
R' (8) is the number of record handlings for algorithm (Reference 8). 18 13 5 i 0 10 
For the range (Case Ii in (5)). 19 13 6 N 1 14 
20 13 7 1 2 15 
C; in (5) has been incremented to the next level (Ci + C2 ) and X is now 21 
the number of sequences C2 has been incremented towards the (i+1) table level. . = 13 8 S 0 0 
in (8) this intermediate level (Case II) is not reached as m increases uniformly 
from 0 to (C; - 1), 
For Case I! range in (5) then 
X + Co= m 
Hence 
R' (8) = (Cy +m) G = (C1 +Cot+ X) G 
which is the same as in (5). 
*, RY (5) = R' (8) 
14 


13 


Algorithm (8) 


Number of 


Sequences Cy 
3 8 
14 8 
15 8 
16 8 
7 8 
18 8 
19 8 
20 8 
21 8 

Algorithm (5) 
x R' 
0 0 
tT - 6 
2 7 
3 8 
4 9 
0 10 
1 14 
2 15 

0 


A plot of R' as a function of file size for these redistribution schemes is shown 


in Figure 2. 


Co m 
5 0 
6 1 
7 2 
8 . 3 
9 . 4 
10 5 
11 6 
12 7 
B 0 
Case | 
Case Il 
Co = 5 
x+Co=m 
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R'=(C, +m); m = 0, R' =0 


Algorithm’ (8) 


m R' 
0. 

| t 
2 10 
3 im 
4 12 
5 13 
6 14 
7 1S 
0 0 


The redistribution scheme proposed in (5) requires fewer record handlings than 
the scheme proposed in (6) modified for 3 tapes. Furthermore, as the number of sequences 
increases, (6) requires proportionally more record ‘handling than does (5). Redistribution 
scheme (5) requires less than or an equal number of record handlings than does (8). 


There are a number of different redistribution schemes that could be investi~ 
gated and evaluated as to their relative efficiency. In addition, the mode of outputting 
from Phase 1 affects the type of redistribution phase needed and must also be considered 
in evaluating the total sequence distribution and redistribution. 
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